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We  characterized  and  evaluated  the  functional  attributes 
of  three  yeast  high-confidence  protein-protein  interaction 
data  sets  derived  from  affinity  purification/mass  spec¬ 
trometry,  protein-fragment  complementation  assay,  and 
yeast  two-hybrid  experiments.  The  interacting  proteins 
retrieved  from  these  data  sets  formed  distinct,  partially 
overlapping  sets  with  different  protein-protein  interaction 
characteristics.  These  differences  were  primarily  a  func¬ 
tion  of  the  deployed  experimental  technologies  used  to 
recover  these  interactions.  This  affected  the  total  cover¬ 
age  of  interactions  and  was  especially  evident  in  the  re¬ 
covery  of  interactions  among  different  functional  classes 
of  proteins.  We  found  that  the  interaction  data  obtained  by 
the  yeast  two-hybrid  method  was  the  least  biased  toward 
any  particular  functional  characterization.  In  contrast,  in¬ 
teracting  proteins  in  the  affinity  purification/mass  spec¬ 
trometry  and  protein-fragment  complementation  assay 
data  sets  were  over-  and  under-represented  among  dis¬ 
tinct  and  different  functional  categories.  We  delineated 
how  these  differences  affected  protein  complex  organi¬ 
zation  in  the  network  of  interactions,  in  particular  for 
strongly  interacting  complexes  (e.g.  RNA  and  protein  syn¬ 
thesis)  versus  weak  and  transient  interacting  complexes 
(e.g.  protein  transport).  We  quantified  methodological  dif¬ 
ferences  in  detecting  protein  interactions  from  larger  pro¬ 
tein  complexes,  in  the  correlation  of  protein  abundance 
among  interacting  proteins,  and  in  their  connectivity  of 
essential  proteins.  In  the  latter  case,  we  showed  that 
minimizing  inherent  methodology  biases  removed  many 
of  the  ambiguous  conclusions  about  protein  essentiality 
and  protein  connectivity.  We  used  these  findings  to  ra¬ 
tionalize  how  biological  insights  obtained  by  analyzing 
data  sets  originating  from  different  sources  sometimes  do 
not  agree  or  may  even  contradict  each  other.  An  impor¬ 
tant  corollary  of  this  work  was  that  discrepancies  in  bio¬ 
logical  insights  did  not  necessarily  imply  that  one  detec¬ 
tion  methodology  was  better  or  worse,  but  rather  that,  to 
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a  large  extent,  the  insights  reflected  the  methodological 
biases  themselves.  Consequently,  interpreting  the  protein 
interaction  data  within  their  experimental  or  cellular  con¬ 
text  provided  the  best  avenue  for  overcoming  biases  and 
inferring  biological  knowledge.  Molecular  &  Cellular 
Proteomics  10:  10.1074/mcp.MI1 1.012500,  1-17,  2011. 

The  collection  of  proteins  and  protein  assemblies  in  a  cell 
constitutes  a  vital  and  integral  part  of  the  machinery  required 
to  sustain  all  cellular  functions  and  processes  (1).  Given  that 
most  proteins  are  part  of  one  or  more  protein  complexes, 
protein-protein  interactions  are  essential  in  understanding  the 
nature  of  protein-mediated  biological  processes.  Therefore, 
because  of  the  large  number  of  potential  protein  interactions, 
high-throughput  technologies  are  essential  for  generating 
whole-cell  maps  of  these  interactions  (2,  3).  Several  large- 
scale  protein  interaction  data  sets  of  the  yeast  Saccharomy- 
ces  cerevisiae  have  been  determined  using  different  high- 
throughput  technologies,  namely  the  following:  (1)  affinity 
purification  followed  by  mass  spectroscopy  (AP/MS)1  (4-7), 
(2)  protein-fragment  complementation  assay  (PCA)  (8),  and  (3) 
yeast  two-hybrid  (Y2H)  (9-11).  Each  approach  detects  and 
reports  interactions  in  a  distinct  manner.  The  Y2H  and  PCA 
techniques  detect  binary  interactions,  whereas  the  AP/MS 
techniques  purify  and  identify  protein  complexes.  All  three 
methods  at  some  point  rely  on  modified  protein  constructs  to 
identify  protein  interactions.  For  example,  although  the 
AP/MS  uses  tagged  bait  proteins  to  bind  to  prey  proteins  in 
the  native  cellular  environment,  followed  by  affinity  purifica¬ 
tion  and  mass  spectrometry  detection  of  proteins,  both  Y2H 
and  PCA  rely  on  separate  protein  complementation  schemes 
to  ultimately  report  on  whether  a  protein  pair  is  interacting.  In 
addition,  the  AP/MS  and  PCA  methods  identify  interactions  at 
approximate  physiological  cellular  protein  concentrations,  the 
concentrations  of  the  interacting  partners  in  the  Y2H  screens 
are  not  necessarily  comparable  to  that  found  in  the  native 


1  The  abbreviations  used  are:  AP/MS,  affinity  purification  followed 
by  mass  spectroscopy;  BGS,  binary  gold  standard;  IDBOS,  interac¬ 
tions  detected  based  on  shuffling;  MIPS,  Munich  Information  Center 
for  Protein  Sequences;  PCA,  protein-fragment  complementation  as¬ 
say;  Y2H,  yeast  two-hybrid. 
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environment.  Furthermore,  the  Y2H  method  requires  that  in¬ 
teracting  partners  are  present  in  the  nucleus  in  order  for  their 
interaction  to  be  detectable. 

The  reliability  of  each  technique  has  been  extensively  re¬ 
viewed  in  the  literature  and  comprehensive  analyses  have 
often  resulted  in  contrasting  conclusions  (8,  11-17).  For  ex¬ 
ample,  the  overlap  of  Y2H  screens  by  different  laboratories  is 
often  small  (18),  suggesting  high  false-negative  rates, 
whereas  AP/MS  screens  infer  a  substantial  fraction  of  indirect 
interactions  (11),  suggesting  high  false  positive  rates.  How¬ 
ever,  it  is  generally  accepted  that  any  measure  of  reliability  is 
not  absolute  and  largely  dependent  on  the  nature  of  the 
selected  gold  standard  reference  set  (1 1 ,  19).  Several  studies 
have  attempted  to  identify  subsets  of  high-confidence  inter¬ 
actions  in  the  raw  AP/MS  (6,  7,  14,  20,  21)  and  the  Y2H  data 
(1 1).  To  date,  only  one  comprehensive  PCA  data  set  exists  for 
yeast  (8),  limiting  assessments  of  interlaboratory  variability 
and  reproducibility  of  this  method  (22).  Recently,  Yu  and 
colleagues  consolidated  three  Y2H  data  sets  into  a  single 
high-confidence  set  and  showed  that  this  set  is  more  enriched 
with  interactions  found  in  the  manually  curated  binary  gold 
standard  (BGS)  data  set  than  the  combined  set  from  two 
AP/MS  studies  (11).  More  recently,  our  group  developed  a 
novel  approach  to  score  pair-wise  protein  associations  de¬ 
rived  from  AP/MS  data  sets  (14).  This  procedure,  termed 
interaction  detection  based  on  shuffling  (IDBOS),  computes  a 
co-occurrence  significance  score  for  two  proteins  by  compar¬ 
ing  the  number  of  times  they  are  experimentally  observed  to 
co-purify  with  those  obtained  from  commensurate  random¬ 
ized  simulations.  This  data  set  identifies  binary  interactions  as 
well  as,  or  better  than,  the  high  confidence  consolidated  Y2H 
set  and  previous  high-confidence  data  sets  based  on  AP/MS 
purifications.  These  results  are  of  particular  importance  be¬ 
cause,  unlike  previous  studies  (7,  20),  the  IDBOS  procedure, 
which  generates  the  high-confidence  AP/MS  set  is  a  purely 
numerical  approach  that  requires  no  training  set  or  machine 
learning,  resulting  in  data  sets  that  are  less  likely  to  be  biased 
by  previous  knowledge.  Topological  analyses  reveal  stark 
differences  with  respect  to  the  modularity  of  the  networks 
between  different  data  sets,  in  particular,  our  high-confidence 
AP/MS  network  exhibits  densely  connected  regions  of  pro¬ 
teins  (14)  indicative  of  functional  modules  (23-25).  Here,  the 
IDBOS-derived  high-confidence  data  set  is  compared  and 
contrasted  with  the  consolidated  Y2H  and  the  PCA  high- 
confidence  data  sets. 

The  apparent  dependence  of  the  protein  interaction  data  or 
network  on  the  detection  methodology  raises  two  fundamen¬ 
tal  questions:  what  are  the  different  methods  actually  detect¬ 
ing  and  how  does  this  influence  the  downstream  analyses  and 
interpretation  of  the  data  as  protein  interaction  data?  The  core 
question  of  whether  two  proteins  bind  together  is  a  thermo¬ 
dynamic  question  that  involves  the  binding  free  energy  asso¬ 
ciated  with  the  bound  protein  complex  as  opposed  to  two 
infinitely  separated,  noninteracting,  proteins  (26-29).  This 


quantity  relates  to  the  ability  of  two  proteins  to  bind.  Even  if 
this  information  were  available,  the  chemical  and  biological 
state  of  a  cell  will  dynamically  determine  if  binding  actually 
takes  place  (30).  The  uncertainty  increases  if  we  were  to  probe 
this  interaction  via  an  experimental  technique  that  would  af¬ 
fect  the  chemical  and  biological  state  of  the  cell  as  well  as  the 
proteins'  microenvironment.  Therefore,  given  that  the  results 
are  not  independent  of  the  experimental  techniques  and  that 
each  experimental  technique  yields  a  completely  different  set 
of  interactions,  how  can  one  interpret  the  data  in  any  mean¬ 
ingful  fashion?  The  most  general  approach  is  to  keep  each 
data  source  separate  and  identify  biological  insights  captured 
by  all  methods.  However,  this  approach  is  hampered  by  a  lack 
of  sufficient  overlap  between  the  existing  data  sets  and, 
hence,  could  still  produce  contradictory  conclusions  about 
the  biology  associated  with  or  derived  from  the  underlying 
interaction  data. 

Herein,  we  have  analyzed  and  compared  three  categories 
of  high-confidence  high-throughput  protein  interaction  data 
sets  to  highlight  the  apparent  differences  in  biological  content 
associated  with  these  data  sets.  The  high-confidence  nature 
of  these  data  sets  ensures  that  the  experimental  uncertainty 
associated  with  large  genomic-scale  interaction  screens  is 
minimized  and  allows  us  to  focus  on  the  inherent  method¬ 
ological  differences.  We  found  that  there  were  marked  differ¬ 
ences  in  terms  of  which  proteins  were  retrieved  from  these 
screens,  how  their  interactions  were  distributed  among  differ¬ 
ent  functional  classes  of  proteins,  and,  in  particular,  that  there 
were  strong  methodological  biases  for  the  retrieval  of  proteins 
belonging  to  small  versus  large  interaction  complexes.  We 
mapped  known  Munich  Information  Center  for  Protein  Se¬ 
quences  (MIPS)  (31,  32)  protein  complexes  onto  each  high- 
confidence  protein-protein  interaction  network  to  highlight  the 
distinct  higher-order  organization  between  functional  compo¬ 
nents  in  the  data  sets.  These  differences  were  partly  reflec¬ 
tive  of  the  experimental  technology  and  were  germane  to 
the  downstream  analyses  and  interpretations  of  each  high- 
confidence  network  in  terms  of  correlation  of  protein  abun¬ 
dance  among  interacting  proteins  pairs  and  the  location  of 
essential  proteins  in  the  composite  protein  interaction  net¬ 
work.  In  the  latter  case,  we  derived  a  consensus  analysis 
that  minimized  the  experimental  biases  and  showed  that  the 
essentiality-connectivity  correlation  was  present  in  these 
data  sets. 

In  summary,  we  quantified  the  differences  between  three 
high-confidence  protein-protein  interaction  networks  in  yeast 
and  showed  how  the  different  methodologies  affect  the  biolog¬ 
ical  interpretation  of  the  data.  Specific  interactions  and  conclu¬ 
sions  derived  from  selected  protein  interactions  are,  at  the 
current  state  of  knowledge  and  experimental  capacity,  strongly 
tied  to  the  underlying  experimental  platforms  and,  hence,  com¬ 
parisons  of  biological  insights  derived  from  data  sets  with  dif¬ 
ferent  biological  characteristics  may  be  contradictory  without 
accounting  for  the  underlying  experimental  biases  themselves. 
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Table  I 

Annotation  coherence  among  interacting  proteins  for  selected  protein  interaction  data  sets.  For  each  top-level  annotation  class  of  "Function,  ” 
“Location,”  and  “Complex”  in  the  Munich  Information  Center  for  Protein  Sequences,  we  classified  an  interaction  as  intra-annotation  if  both 
proteins  were  annotated  and  shared  at  least  one  common  annotation  item,  or  interannotation  if  both  proteins  were  annotated  but  no  common 
annotation  was  shared.  The  sum  of  the  intra-  and  inter-annotation  can  add  up  to  less  than  the  total  number  of  interactions  due  to  cases  where 
at  least  one  protein  lacks  any  annotation.  The  number  in  parenthesis  gives  the  percentage  of  the  total  number  of  interactions  in  Column  3.  The 
Methods  Section  provides  descriptions  and  characteristics  of  the  separate  data  sets.  Abbreviations:  AP/MS,  affinity  purification/mass 
spectrometry;  PCA,  protein-fragment  complementation  assay;  Y2H,  yeast  two-hybrid 


Number  of  Number  of  _ Function _  _ Location _  _ Complex 


Proteins 

Interactions 

Intra-annotation 

Inter-annotation 

Intra-annotation 

Inter-annotation 

Intra-annotation 

Inter-annotation 

Manually  curated  binary  interaction 

BGS 

1061 

1239 

1176  (95%) 

41  (3%) 

1092  (88%) 

83  (7%) 

521  (42%) 

102  (8%) 

High-confidence  high-throughput 

AP/MS 

1274 

7879 

6722  (85%) 

1007  (13%) 

6914  (88%) 

626  (8%) 

2462  (31  %) 

819  (10%) 

PCA 

1076 

2530 

1315  (52%) 

732  (29%) 

1802  (71%) 

530  (21  %) 

161  (6%) 

109  (4%) 

Y2H 

1962 

2703 

1161  (43%) 

834  (31  %) 

1691  (63%) 

580  (21  %) 

148  (6%) 

113  (4%) 

Raw  high-throughput 

AP/MS 

2551 

18,043 

9537  (53%) 

7315  (40%) 

12,486  (69%) 

5048  (28%) 

1393  (8%) 

2731  (15%) 

RESULTS  AND  DISCUSSION 

We  investigated  three  high-confidence  high-throughput 
protein-protein  interaction  data  sets  (AP/MS,  PCA,  and  Y2H), 
an  unfiltered,  raw  interaction  data  set  (raw-AP/MS),  and  the 
manually  curated  BGS  set  which  focuses  on  binary  protein 
interactions.  The  Methods  Section  describes  the  selection 
and  salient  features  of  these  data  sets,  whereas  columns  1-3 
of  Table  I  summarize  the  number  of  proteins  and  interactions 
contained  in  these  data  sets.  Herein,  we  first  discuss  the 
classification  of  detected  proteins  and  their  interactions 
based  on  the  detection  methodology,  and  then  we  highlight 
the  impact  of  the  observed  differences  in  the  biological  prop¬ 
erties  of  the  interacting  proteins. 

Classification  of  Detected  Proteins  and  Their  Interactions 

Functional  Diversity  in  Protein  Interaction  Data  Sets— Al¬ 
though  genomic-scale  protein-protein  interaction  detection 
campaigns  are  by  design  intended  to  probe  as  many  interac¬ 
tions  as  possible,  it  is  well  known  that  the  retrieved  interac¬ 
tions  do  not  actually  overlap.  Even  for  the  high-confidence 
data  sets  analyzed  herein,  the  overlap  between  interactions  in 
the  AP/MS:Y2H,  AP/MS:PCA,  and  Y2H:PCA  sets  in  Table  I 
were  175,  182,  and  66,  respectively.  Out  of  all  13,112  high- 
confidence  interactions  in  Table  I,  only  18  interactions  (0.1%) 
were  common  among  all  three  sets.  This  was  despite  a  rela¬ 
tively  larger  partial  overlap  between  the  constituent  proteins 
among  the  three  data  sets,  with  overlaps  between  the  AP/MS: 
Y2H,  AP/MS: PCA,  and  Y2H:PCA  data  sets  in  Table  I  of  545, 
357,  and  440,  respectively,  with  182  proteins  (4.2%)  common 
among  all  three  sets.  Although  the  overlap  was  larger  than  for 
the  interactions  themselves,  each  protein  set  was  quite  dis¬ 
tinct  and  a  functional  classification  of  these  protein  sets  al¬ 
lowed  us  to  generalize  the  different  characteristics  between 
the  data  sets.  Fig.  1 A  shows  the  relative  distribution  of  func¬ 
tional  classes  for  all  S.  cerevisiae  proteins  for  ten  selected 
MIPS  functional  classes  (labeled  “Expected”)  and  the  differ¬ 
ence  from  these  values  for  the  interacting  proteins  from  the 


three  high-confidence  data  sets,  AP/MS,  PCA,  and  Y2H. 
Across  all  17  MIPS  functional  classes,  the  Y2H  and  PCA  data 
sets  showed  the  smallest  deviation  between  the  measured 
and  “Expected”  values  with  root  mean  squared  differences  of 
0.03  and  0.04,  respectively,  compared  with  0.09  for  the 
AP/MS  data.  We  found  the  largest  deviations  in  the  AP/MS 
data  set  associated  with  under-representation  of  metabolic 
proteins  and  cell  rescue,  and  over-representations  in  the  cat¬ 
egories  of  transcription,  protein  synthesis,  and  protein  bind¬ 
ing.  For  the  PCA  data  set,  the  largest  under-representation 
was  of  cell-cycle  proteins  and  protein  synthesis  proteins,  and 
the  largest  over-representation  was  of  cellular  transport  pro¬ 
teins.  If  the  selection  of  tested  protein  is  unbiased,  these 
differences  should  reflect  the  unavoidable  experimental  bi¬ 
ases  associated  with  each  detection  method.  For  example, 
the  under-representation  of  metabolic  proteins  in  the  AP/MS 
data  can  be  rationalized  by  the  fact  that  metabolic  proteins,  in 
general,  do  not  function  in  large  complexes  or  bind  strongly  to 
other  proteins  (33);  hence,  it  is  expected  that  the  affinity 
purification  step  will  most  likely  result  in  loss  of  these  proteins. 
Similarly,  proteins  that  function  in  larger,  tightly  organized 
assemblies  involved  in  transcription  or  protein  synthesis 
would  be  preferentially  included  in  the  AP/MS  data  set, 
whereas  the  restriction  imposed  on  the  PCA  and  Y2H  reporter 
systems  would  not  favor  these  protein  classes. 

The  number  of  detected  interactions  each  protein  has  with 
other  proteins  was  strongly  dependent  on  the  functional  class 
membership  and  detection  methodology.  Fig.  IS  shows  the 
average  number  of  interactions  (or  degree)  distributed  among 
the  same  functional  categories  as  in  Fig.  1 A  for  the  three  data 
sets.  There  are  large  and  clear  differences  between  the 
AP/MS  data  set  on  the  one  hand  and  the  PCA  and  Y2H  data 
sets  on  the  other  hand,  particularly  for  proteins  involved  in  cell 
cycle,  transcription,  protein  synthesis,  protein  fate,  protein 
binding,  and  cell-component  biogenesis.  Fig.  1C  shows  the 
relative  distribution  of  these  interactions  among  all  detected 
interactions  for  each  experimental  technique.  These  relative 
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■  Metabolism  (ME) 

■  T ranscription  (TR) 

■  Protein  fate  (PF) 
Cellular  transport  (CT) 


■  Cell  cycle  (CC) 

■  Protein  synthesis  (PS) 
» Protein  binding  (PB) 

» Other 


Fig.  1 .  Functional  diversity  among  proteins  and  interactions 
present  in  the  high-confidence  data  sets.  A,  The  relative  distribu¬ 
tion  of  all  proteins  from  Saccharomyces  cerevisiae,  as  annotated 
according  to  the  Munich  Information  Center  for  Protein  Sequences 
functional  categories,  is  shown  by  the  gray  line  labeled  “Expected.”  It 
denotes  the  relative  fraction  (coverage)  of  all  yeast  proteins  that 
belong  to  a  given  category.  The  deviations  of  the  AP/MS,  PCA,  and 
Y2H  high-confidence  data  sets  from  this  distribution  are  shown  by  the 
different  colored  bars,  e.g.  proteins  labeled  “Metabolism”  are  rela¬ 
tively  under-represented  in  the  AP/MS  data  set.  8,  The  average 
degree  of  the  proteins  in  a  given  functional  category  for  each  high- 
confidence  data  set.  C,  The  relative  distribution  of  protein-protein 
interactions  in  the  different  functional  categories  for  each  high-con- 
fidence  data  set. 


distributions  indicate  that  the  data  sets  contained  clearly  dif¬ 
ferentiated  sets  of  distinct  sets  of  interactions  distributed 
among  varying  functional  classes.  Thus,  the  AP/MS  interac¬ 
tion  data  were  skewed  toward  proteins  in  transcription  (23%) 
and  protein  synthesis  (15%),  the  PCA  data  were  skewed 
toward  proteins  in  cellular  transport  (16%)  and  metabolism 
(13%),  and  the  two  largest  groups  in  the  Y2H  data  were 


♦AP/MS  ♦  PCA  "Y2H 

Fig.  2.  High-confidence  data  set  coverage  of  the  protein-pro¬ 
tein  interaction  matrix.  We  have  mapped  each  interaction  in  the 
AP/MS,  PCA,  and  Y2FI  data  sets  to  the  yeast  proteome  protein- 
protein  interaction  matrix  defined  by  all  possible  binary  interactions. 
The  proteins  were  ordered  according  to  their  Munich  Information 
Center  for  Protein  Sequences  functional  categories.  We  have  indi¬ 
cated  the  location  of  proteins  belonging  to  the  categories  of  metab¬ 
olism  (ME),  cell  cycle  (CC),  transcription  (TR),  protein  synthesis  (PS), 
protein  fate  (PF),  protein  binding  (PB),  and  cellular  transport  (CT).  We 
have  enlarged  the  symbols  of  each  interaction  to  make  the  differ¬ 
ences  among  the  data  sets  and  functional  categories  more  visible. 
The  different  methodologies  retrieved  different  interaction  sets  influ¬ 
enced  by  the  underlying  experimental  platform,  e.g.  AP/MS  recovered 
tightly  bound  protein  complexes  associated  with  transcription,  pro¬ 
tein  synthesis,  and  proteins  binding,  whereas  PCA  recovered  many 
more  weakly  bound  interactions,  e.g.  those  involved  in  cellular 
transport. 

proteins  in  metabolism  (12%)  and  protein  fate  (12%).  This 
confirms  that  the  different  high-confidence  interaction  data 
sets  were  associated  with  proteins  in  varying  functional 
classes  whose  proteins  had  a  distinct  number  of  interactions, 
resulting  in  a  unique  distribution  of  interactions  among  spe¬ 
cific  protein  sets. 

Fig.  2  shows  the  projection  of  the  three  high-confidence 
data  sets  on  the  protein-protein  interaction  matrix  encom¬ 
passing  all  possible  binary  interactions  that  can  be  formed 
based  on  the  yeast  proteome.  The  proteins  are  ordered  ac¬ 
cording  to  their  MIPS  functional  categorization  and  follow  the 
same  order  as  in  Fig.  1,  i.e.  metabolism,  cell  cycle,  etc.  It  is 
well  known  that  the  sparsity  of  the  matrix  is  due  to  the  rela¬ 
tively  low  estimates  of  the  yeast  interactome  (~105  unique 
protein-protein  interactions)  compared  with  the  total  number 
of  possible  pair-wise  interactions  (~107)  (34,  35).  The  interac¬ 
tions  mapped  out  in  Fig.  2  show  that  the  different  data  sets 
covered  distinct  parts  of  the  interaction  space,  with  some 
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functional  categories  relatively  well  covered  whereas  others 
were  disproportionately  sparse.  This  observation  is  due  partly 
to  biological  reasons  and  partly  to  the  methodological  differ¬ 
ences  in  detecting  the  interactions.  Thus,  we  expected  that 
direct  interactions  among  cell  cycle  proteins  (CC)  and  pro¬ 
teins  involved  in  protein  synthesis  (PS)  would  be  absent.  We 
confirmed  this  expectation  for  all  three  experimental  plat¬ 
forms.  Similarly,  the  intersection  between  PS  and  protein  fate 
(PF)  should  be,  and  was,  relatively  free  from  interactions  in  all 
three  data  sets.  The  methodological  biases  noted  in  Fig.  1 
were  also  evident  in  the  interaction  mapping  in  Fig.  2;  the 
functional  categories  of  PS  and  protein  binding  (PB)  were  well 
represented  by  the  AP/MS  data  set,  whereas  interactions 
among  metabolic  proteins  (ME)  were  relatively  sparse  among 
all  data  sets.  Fig.  2  also  shows  the  high  proportion  of  inter¬ 
actions  within  the  cellular  transport  (CT)  category  associated 
with  the  PCA  data  set. 

The  protein-protein  interaction  matrix  can  also  be  used  to 
gauge  the  level  of  influence  biological  functions  have  on  the 
detected  interactions.  A  randomly  generated  selection  of  in¬ 
teractions,  on  the  same  order  of  magnitude  as  the  number  of 
interactions  among  the  high-confidence  data  sets  (shown  in 
Table  I),  will  not  show  any  preferential  clustering  among  or 
between  functional  categories  in  the  matrix  (shown  in  Fig.  2). 
We  characterized  this  property  by  calculating  the  distribution 
of  distances  between  each  two  points  within  a  data  set  in  the 
matrix  and  comparing  the  distributions  of  the  data  sets 
(supplementary  Fig.  SI).  This  analysis  showed  that  the  inter¬ 
actions  in  the  Y2H  data  set  were  distributed  most  similarly  to 
those  in  the  randomly  selected  interaction  data  compared 
with  the  AP/MS  and  PCA  data  sets.  This  does  not  indicate 
that  the  Y2H  data  were  random;  rather,  it  demonstrated  that 
the  interactions  detected  in  this  data  set  were  not  biased 
toward  any  particular  functional  set  or  classification  (1 1).  This 
implied  that  the  experimental  detection  of  these  interactions 
was  not  influenced  by  the  biological  function  of  the  tested 
proteins  and,  therefore,  provided  a  markedly  unbiased  test  of 
the  proteins’  capability  to  interact.  This  leads  to  increased 
confidence  in  the  ability  of  this  data  set  to  provide  a  founda¬ 
tion  for  a  stricter  thermodynamic  evaluation  of  the  proteins’ 
ability  to  interact  under  the  given  experimental  conditions. 

In  contrast,  native  protein  interactions  are  sensitive  to  local 
chemical  environments,  protein  concentrations,  regulatory 
and  nonequilibrium  processes,  ATP-levels,  phosphorylation 
status,  post-translational  modifications,  etc.,  which  define  the 
“natural”  environment.  Although  one  advantage  of  the  AP/MS 
and  PCA  procedures  is  their  probing  of  interactions  in  this 
natural  environment,  an  unbiased  selection  of  protein  interac¬ 
tions  similar  to  the  Y2H  methods  was  not  achieved.  Although 
this  may  provide  an  advantage  in  detecting  interactions  that 
actually  occur  in  the  cellular  environment  under  the  given 
experimental  conditions,  the  lack  of  direct  overlap  between 
the  AP/MS  and  PCA  data  sets  highlights  the  sensitivity  of 
these  methods  to  experimentally  specific  cellular  conditions. 


In  the  material  below,  we  explored  the  consequences  of  these 
biases  in  selecting  protein  interactions  in  the  three  data  sets. 

Annotation  Consistency  of  High-Confidence  Data  Sets— In 
order  to  characterize  the  retrieved  protein  sets  (1 2,  36,  37),  we 
further  stratified  the  interaction  data  according  to  three  differ¬ 
ent  high-level  MIPS  annotation  classes  as  follows:  “Function,” 
“Location,”  and  “Complex.”  Table  I  summarizes  these  analy¬ 
ses  for  a  reference  set  of  manually  curated  binary  interactions 
(BGS),  the  three  high-confidence  protein  interaction  data  sets, 
as  well  as  for  a  raw  high-throughput  data  set.  For  each  of  the 
annotation  classes  we  have  further  subdivided  the  interaction 
data,  based  on  whether  interacting  proteins  share  their  anno¬ 
tations,  into  inter-  and  intra-annotation  groups.  For  example, 
if  both  members  of  a  protein-protein  interaction  pair  belong 
to  the  same  functional  category  of  “metabolism,”  we  classi¬ 
fied  the  pair  as  an  intra-annotation  pair;  otherwise,  we  clas¬ 
sified  the  pair  as  an  inter-annotation  pair.  For  the  annotation 
class  “Function,”  we  noted  that  the  manually  curated  BGS 
data  set  was  heavily  skewed  toward  protein  annotations  be¬ 
longing  to  the  same  versus  different  functional  categories 
(95%  versus  3%).  The  same  holds  true  for  the  high-confi- 
dence  AP/MS  set  (85%  versus  13%)  in  sharp  contrast  to  the 
corresponding  raw  data  (53%  versus  40%).  In  this  case,  the 
extraction  of  the  high-confidence  subset  from  the  raw  data 
resulted  in  recovering  more  protein  interactions  belonging  to 
the  same  functional  category.  The  similar  fraction  of  intra-  and 
interannotation  pairs  in  the  BGS  and  the  high  confidence 
AP/MS  data  set  was  not  caused  by  overlapping  interactions 
between  these  sets,  as  only  340  of  the  these  interactions 
coincided.  This  implies  that  the  AP/MS  technique,  although 
designed  to  detect  complexes,  actually  possesses  a  large 
binary  interaction  character  (1 4).  Neither  the  PCA  nor  the  Y2H 
data  sets  showed  the  same  sharp  distinction  between  intra- 
and  interannotations  of  protein  interaction  pairs.  Of  course, 
the  general  statement  that  interacting  proteins  tend  to  belong 
to  the  same  functional  category  holds,  regardless  of  the  meth¬ 
odology  or  any  biases  in  selection  of  tested  proteins.  Simi¬ 
larly,  relative  observations  also  hold  true  for  the  annotation 
class  “Location.”  In  the  class  “Complex,”  which  contains 
proteins  known  to  be  associated  with  specific  protein  com¬ 
plexes,  the  lack  of  a  comprehensive  set  of  protein  annotations 
sharply  reduced  the  number  of  protein  interactions  that  we 
could  classify,  especially  for  the  PCA  and  Y2H  data  sets. 
However,  it  was  clear  that  both  the  reference  standard  (BGS) 
and  the  high-confidence  AP/MS  data  sets  were  still  enriched 
in  intra-annotation,  compared  with  interannotation  complex 
pairs. 

The  observation  that  interacting  proteins  share  a  common 
function  indicates  that  the  interaction  data  itself  could  carry 
information  about  the  organization  of  functional  modules  (23— 
25).  Consistent  with  the  high  intra-annotation  fraction  in  the 
AP/MS  data  set,  Fig.  2  shows  the  densely  populated  diagonal 
of  the  protein  interaction  matrix.  The  AP/MS  interactions  data 
set  was  further  characterized  by  distinct,  densely  connected 
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Fig.  3.  Organization  of  interactions  and  biological  processes  in  the  AP/MS  data  set.  The  top  row  illustrates  the  effect  of  increasing  the 
fidelity  of  the  network  representation  by  decreasing  the  false  positive  rate  (FPR)  associated  with  the  interactions.  The  data  set  commensurate 
with  a  5%  false  positive  was  designated  as  the  high-confidence  AP/MS  data  set  in  this  work.  The  bottom  row  shows  the  projection  of  Munich 
Information  Center  for  Protein  Sequences  (MIPS)  high-level  annotations  color  coded  for  different  “Function,”  “Location,”  and  “Complex” 
categories.  We  assigned  each  protein  in  the  network  only  one  of  its  MIPS  function  annotation  item(s)  to  maximize  the  number  of  homogeneous 
interactions  of  the  network  using  a  Monte  Carlo  algorithm  (See  Methods).  We  also  outlined  selected  major  biological  processes  in  these 
interaction  maps.  The  complete  color  scheme  and  annotations  for  the  “Complex”  annotations  are  provided  as  an  interactive  and  viewable  map 
in  the  Supporting  Material. 


off-diagonal  regions,  reflective  of  the  high-confidence  nature 
of  this  data  set.  The  top  row  of  Fig.  3  illustrates  this  point  by 
showing  the  difference  in  connectivity  between  a  low-confi¬ 
dence  raw  network  and  higher  confidence  networks  (com¬ 
mensurate  with  20  and  5%  false  positive  rates  (See  Methods)) 
based  on  our  IDBOS-analysis  of  the  AP/MS  data.  The  high 
intra-annotation  fraction  of  this  data  set  manifested  itself  as 
clusters  of  protein  interactions  among  proteins  with  the  same 
biological  function,  effectively  defining,  as  well  as  identifying, 
clustered  modules.  The  bottom  row  of  Fig.  3  shows  the  highly 
clustered  annotations  for  the  “Function,”  “Location,”  and 
“Complex”  annotation  classes  from  Table  I  for  the  high-con- 
fidence  AP/MS  data  set  (commensurate  with  a  5%  false  pos¬ 
itive  rate).  The  modularity  of  the  organization  of  this  protein 
interaction  network  was  evident  by  the  concomitant  clustering 
of  functional  properties  (same  color)  to  distinct  sets  of  pro¬ 


teins.  The  interactions  colored  according  to  their  “Complex” 
annotations  are  explored  in  more  detail  in  the  “MIPS  Complex 
Annotation  of  Interaction  Networks”  Section. 

Protein  Complex  Size  Dependence— Proteins  often  assem¬ 
ble  into  larger  functional  multiprotein  complexes  that  strongly 
determine  how  proteins  interact  and  arrange  themselves. 
Herein,  we  demonstrated  the  systematic  biases  associated 
with  the  detection  techniques  as  manifested  in  the  depend¬ 
ence  on  the  number  of  interacting  proteins  retrieved  from  a 
given  MIPS  complex  as  a  function  of  the  size  of  the  MIPS 
complex.  Fig.  4 A  shows  the  relative  fraction  of  proteins  from 
intracomplex  interactions  detected  for  each  high-confidence 
data  set  as  a  function  of  the  size  of  the  complex.  Although 
there  was  some  scatter  in  the  data,  it  was  clear  that  both  the 
PCA  and  Y2H  data  were  enriched  with  proteins  from  smaller- 
sized  (less  than  25)  complexes  compared  with  what  would  be 
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Fig.  4.  Complex  size  and  intra-  and  interfunction  detection. 

A,  The  relative  distribution  of  sizes  of  Munich  Information  Center  for 
Protein  Sequences  complexes  in  the  Saccharomyces  cerevisiae  pro- 
teome  is  shown  by  the  gray  line  labeled  “Expected.”  The  y  axis  shows 
the  fraction  of  proteins  in  a  given  data  set  that  is  associated  with  a 
specific  complex  size.  We  have  indicated  the  corresponding  distribu¬ 
tions  for  proteins  found  in  the  complexes  of  each  high-confidence 
data  set  by  different  symbols.  Whereas  both  the  PCA  and  Y2H  data 
sets  lacked  representations  among  the  large-sized  clusters,  the 
AP/MS  data  set  roughly  followed  the  “Expected”  distribution  derived 
from  all  yeast  proteins.  B,  The  influence  of  the  false  positive  rate  on 
the  fraction  of  intra-  and  interfunction  protein  interactions  detected  in 
the  AP/MS  data  set.  For  reference,  we  have  indicated  by  arrows  the 
PCA  and  Y2H  values  for  intra-  and  interfunction  annotation  fractions 
from  Table  I  on  the  y  axis. 

expected  from  an  analysis  of  all  MIPS  yeast  proteins.  Both  the 
PCA  and  Y2H  data  lacked  interaction  data  from  larger-sized 
complexes.  This  is  in  contrast  to  the  AP/MS  data,  which 
closely  followed  the  Expected  distribution  present  among  all 
annotated  yeast  proteins,  even  for  the  larger  size  MIPS  clus¬ 
ters.  Although  the  general  view  is  that  AP/MS  detects  co¬ 
complexes  and  Y2H  detects  binary  interactions,  this  is  an 
over  simplification.  The  high-confidence  AP/MS  data  contain 
a  wealth  of  known  binary  interactions  and  conversely,  as 
shown  here,  the  Y2H  data  capture  many  binary  interactions 
among  MIPS  complexes  with  less  than  25  members. 

Apart  from  experimental  constraints  in  the  PCA  methodol¬ 
ogy,  which  limits  interaction  detection  to  within  an  8-nm  res¬ 
olution,  protein  crowding  affects  interaction  detection  in  both 
PCA  and  Y2H.  Both  of  these  detection  methods  explicitly  rely 
on  the  reconstitution  of  the  biological  activity  of  a  particular 
protein  (transcription  factor  GAL4  for  Y2H  and  the  DFR1 
enzyme  for  PCA).  Even  if  two  investigated  proteins,  as  com¬ 
ponents  of  a  large  native  cellular  complex,  can  interact  and 
constitute  a  functional  unit,  other  components  may  bind  to 
this  unit  disturbing  the  activity  reconstitution  and  conse¬ 
quently  preventing  detection  (8,  11).  The  probability  of  such 
disturbance  is  roughly  proportional  to  the  number  of  compo¬ 


nents  in  the  complex,  i.e.  the  complex  size,  effectively  pre¬ 
venting  the  detection  of  interactions  within  large  complexes 
using  Y2H  and  PCA  methods.  The  probability  of  this  happen¬ 
ing  would  be  larger  at  the  complex’s  native  cellular  location 
because  of  higher  abundances  of  other  complex  compo¬ 
nents,  e.g.  this  may  partially  explain  why  we  did  not  observe 
any  of  the  interactions  between  RNA  polymerase  II  compo¬ 
nents  in  any  of  the  Y2H  set  (11).  Methodological  biases  in 
detecting  interactions  among  specific  RNA  synthesis  com¬ 
plexes  are  characterized  in  more  detail  in  the  “MIPS  Complex 
Annotation  of  Interaction  Networks”  Section.  The  AP/MS  data 
sets  strictly  do  not  report  an  interaction  per  se,  but  rather  a 
colocalization  of  proteins  in  bait-prey  affinity  purifications. 
Although  these  interactions  should  be  less  sensitive  to  the 
location  of  the  tagged  baits,  the  influence  of  tagging  cannot 
be  ignored  because  it  has  been  previously  demonstrated  that 
intraspecies  overlap  between  AP/MS  data  sets  from  different 
laboratories  remains  low  (38).  Therefore,  whereas  the  high- 
confidence  subsets  of  the  PCA  and  Y2H  data  sets  can  eval¬ 
uate  whether  a  protein  pair  interacts,  the  AP/MS  experimental 
data  can  also  characterize  nonbinary  protein  interactions.  This 
analysis  confirms  that  AP/MS  data  may  be  more  suitable  for 
understanding  interactions  and  biological  relationships  re¬ 
lated  to  structural  complexes  and  larger  molecular  machines, 
whereas  the  Y2H  and  PCA  data  sets  capture  many  binary 
interactions  of  smaller-size  complexes. 

False  Positive  Rate  of  AP/MS  Interactions— The  question 
whether  one  data  set  or  experimental  technique  is  more  ac¬ 
curate  than  another  cannot  be  directly  addressed  by  the 
analyses  presented  herein.  However,  one  advantage  of  our 
derivation  of  high-confidence  interactions  from  the  raw 
AP/MS  data  is  that  we  can  select  data  at  a  given  false  positive 
rate  (See  Methods),  e.g.  as  shown  in  Fig.  3.  As  a  further 
analysis  of  the  data  in  Table  I,  we  calculated  the  fraction  of 
interacting  protein  pairs  that  contain  only  inter-  or  only  intra¬ 
functional  annotations  for  the  AP/MS  data  set  as  a  function  of 
the  false  positive  rate.  Fig.  4B  shows  the  decrease  (increase) 
of  the  intra  (inter)-function  fraction  with  an  increasing  false 
positive  rate,  i.e.  at  a  lower  accuracy.  The  limiting  behavior  at 
the  highest  accuracy  (lowest  false  positive  rate)  approaches 
the  fractions  found  in  the  manually  curated  data  set  of  high- 
confidence  interactions  (BGS).  Intrafunction  annotation  frac¬ 
tions  for  the  AP/MS  data  set  remained  consistently  higher 
than  that  for  either  the  PCA  or  the  Y2H  data  set,  whereas  the 
inter-function  annotation  fractions  leveled  out  at  —30%  at 
higher  false  positive  rates,  similar  to  the  PCA  and  Y2H  data 
sets.  However,  one  should  note  that  the  fractions  of  intra-  and 
interfunction  interactions  are  not  indicative  of  the  false  posi¬ 
tive  rates  in  the  PCA  and  Y2H  data  sets  themselves.  Although 
the  construction  of  the  high-confidence  AP/MS  data  set  did 
not  discriminate  between  inter-  or  intrafunction  protein-pro¬ 
tein  interactions,  there  was  a  systematic  enrichment  of 
intrafunction  interactions  at  the  low  false  positive  rates.  As 
previously  reported  (14),  the  binary  data  contained  in  our 
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high-confidence  AP/MS  data  comprise  interactions  of  the 
same  quality  as  the  manually  curated  data  set  (BGS).  Increas¬ 
ing  confidence  (lowering  the  false  positive  rate)  in  the  AP/MS 
data  increased  the  “binary”  flavor  of  the  interactions.  This  was 
reflected  both  in  the  closer  concordance  with  the  intra-  and 
interfunction  fractions  (Fig.  4B)  and  in  the  fraction  of  proteins 
retained  for  complexes  with  less  than  25  members  (data  not 
shown). 

Biological  Properties  Associated  with  Interacting  Proteins 

We  observed  that  the  high-confidence  high-throughput  in¬ 
teraction  data  sets  contained  strong  remnants  and  signatures 
reflecting  their  experimental  creation  with  respect  to  func¬ 
tional  characteristics  of  the  proteins  retrieved  and  in  the  size 
of  the  underlying  protein  complex  from  which  the  interactions 
were  detected.  We  now  turn  our  attention  to  describing  how 
the  different  interaction  data  sets  reconstitute  important  func¬ 
tional  components  of  the  cell  and  their  interactions,  how  the 
different  methods  capture  the  cellular  abundance  between 
interacting  proteins,  and  how  methodological  biases  influence 
the  connectivity  of  essential  proteins  in  the  reconstructed 
interaction  networks. 

MIPS  Complex  Annotation  of  Interaction  Networks— The 
objective  of  detecting  and  cataloging  a  genome-wide  range  of 
protein-protein  interactions  present  in  a  cell  is  to  understand 
how  these  interactions  mediate  biological  processes.  The 
premise  of  creating  interaction  networks  is  that  biological 
events  can  be  partially  deconstructed  and  understood  by  sets 
of  direct  protein  interactions.  Because  cellular  processes  are 
often  performed  and  mediated  by  protein  complexes  and 
assemblies,  it  is  expected  that  investigating  the  overall  orga¬ 
nization  of  protein  interaction  networks  reconstituted  from 
interaction  data  should  both  recover  known  associations  and 
provide  additional  insights  into  these  processes.  Conse¬ 
quently,  we  mapped  known  MIPS  protein  complexes  (39) 
onto  each  reconstituted  high-confidence  protein  interaction 
network  to  explore  the  biological  implication  of  the  network 
organization.  An  interactive  map  of  the  annotated  high-confi- 
dence  networks  for  use  with  the  chemical  structure  viewer 
Jmol  (http://www.jmol.org/)  is  provided  to  facilitate  further 
explorations  (see  Live  Cellular  Machinery  Map  in  the  Supple¬ 
mentary  Information).  The  networks  exhibited  different  de¬ 
grees  of  patterns  of  connections  between  sets  of  proteins, 
providing  a  wealth  of  information  on  interactions  between 
complexes  and  possible  protein  regulation  of  their  biological 
function. 

Verification  of  known  associations  between  functional  mod¬ 
ules  serves  to  validate  the  inherent  biological  content  of  high- 
confidence  high-throughput  networks  (4-11).  Given  the  high 
degree  of  MIPS  annotated  intrafunction  interactions  in  the 
AP/MS  data  set,  this  reconstructed  network  provided  well- 
separated  and  distinct  modules  in  comparison  to  the  PCA  and 
Y2H  data  sets.  Fig.  5  shows  that  the  known  protein  com- 
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Fig.  5.  The  “Complex”  annotated  protein  interaction  map. 

Known  complexes  from  the  Munich  Information  Center  for  Protein 
Sequences  (MIPS)  database  and  their  biological  relations  were  ap¬ 
parent  in  the  reconstructed  protein-protein  interaction  network  de¬ 
rived  from  the  high-confidence  AP/MS  data  set.  We  colored  compo¬ 
nents  in  a  same  MIPS  complex  the  same,  with  gray  nodes 
representing  unannotated  proteins.  This  reconstructed  network  cap¬ 
tured  a  global  organization  of  protein  complexes,  in  particular  for 
protein  assemblies  related  to  RNA  synthesis,  chromatin  remodeling, 
DNA  replication  and  repair,  protein  synthesis,  and  protein  and  RNA 
degradation  (proteasome  and  exosome).  Proteins  in  transport  com¬ 
plexes  were  connected  among  themselves,  but  the  proteins  that  they 
transport  were  not  captured  in  this  mapping.  Some  labels  have  been 
removed  for  clarity,  and  the  complete  annotations  are  provided  in  the 
Supplementary  Material. 


plexes  from  MIPS  (with  high-throughput  complexes  excluded, 
see  Methods)  were  recovered  well  in  the  AP/MS  data  set.  The 
preponderance  of  annotated  intra-  versus  inter-complex  in¬ 
teractions,  accentuated  the  modular  nature  of  the  network, 
with  many  complexes  of  known  biological  connections  linked 
to  each  other  via  direct  protein  interactions.  For  example,  the 
two  20S  and  19/22S  regulator  units  of  the  proteasome  were 
directly  linked  to  each  other.  The  central  and  dominant  part  of 
the  connected  network  consisted  mainly  of  protein  com¬ 
plexes  responsible  for  RNA  synthesis  and  other  RNA-related 
processes,  protein  synthesis,  protein  degradation,  and  DNA 
replication  and  repair.  As  transport  proteins  were  under-sam¬ 
pled  in  the  AP/MS  data  set  (Fig.  1),  many  of  the  transport 
systems  (e.g.  Transport  Protein  Particle,  Golgi  Transport,  GIM 
complexes,  t-SNAREs,  v-SNAREs,  AP-2,  AP-3,  etc.)  were  not 
connected  to  the  central  network  and  were  represented  as 
isolated  units  on  the  periphery  of  this  map. 

A  global  comparison  between  individual  interactions  and 
functional  complex  interactions  at  the  protein  level  between 
the  three  different  high-confidence  protein  interaction  data 
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1  RNA  polymerase  I  [14] 

2  RNA  polymerase  II  [13] 

3  RNA  polymerase  III  [13] 

4  TFIIF  [3] 

5  TFIIIC  [5] 


6  Chromatin  structure  remodeling  (RSC)  [10] 

7  SWI/SNF  [10] 

8  INO80  [12] 

9  Nucleosomal  protein  complex  [6] 

10  TAFIIs  [12] 


11  SAGA  [16] 

12  SAGA-like  [5] 

13  NuA4  [2] 

14  Histone  deacetylase  complex  [29] 

15  RNA  polymerase  II  mediator  (SRB)  [21] 


AP/MS 


1  Cytoplasm  [2839] 

2  Nucleus  [2140] 

3  Mitochondria  [1043] 

4  ER  location  [537] 

5  Golgi  location  [128] 


6  Golgi  complex  [8] 

7  V-ATPase  [15] 

8  TIM  [2] 

9  TOM  [9] 

10  Transport  protein  particle  [10] 


11  GIM  [5]  16  ERV25  complex  [4] 

12  t-SNAREs  [8] 

13  v-SNAREs  [8] 

14  AP-2  [4] 

15  AP-3  [4] 


Fig.  6.  Connectivity  among  and  between  Munich  Information  Center  for  Protein  Sequences  complexes.  A,  The  number  of  interactions 
between  proteins  among  the  constituent  functional  complexes  associated  with  RNA  synthesis  (rows  and  columns  1-5),  Chromatin  remodeling 
(rows  and  columns  6-9),  and  Other  DNA  interactions  (rows  and  columns  10-15)  are  color-coded  from  none  (dark)  to  three  or  more  (bright  red). 
The  number  of  proteins  in  each  of  the  15  complexes  is  given  in  square  brackets  below  the  graph.  Abbreviations:  TFIIF,  Transcription  factor 
complexes  II  F;  TFIIIC,  Transcription  factor  complexes  III  C;  SWI/SNF,  SWItch/Sucrose  NonFermentable;  TAFIIs,  TATA-binding  protein 
associated  factors;  SAGA,  Spt/Ada/GCN5/acetyltransferase;  NuA4,  nucleosome  acetyltransferase  of  H4.  B,  The  link  between  the  cellular 
locations  of  proteins  and  the  different  protein  transport  assemblies  that  interact  with  these  proteins.  The  figure  shows  the  connection  Z-score 
(See  Methods)  associated  with  co-occurring  protein  labels  of  interacting  proteins  as  a  function  of  their  locations  (rows  and  columns  1-5)  and 
association  with  different  transporter  complexes  (rows  and  columns  6-16).  Abbreviations:  ER,  Endoplasmic  reticulum;  TIM,  the  inner 
mitochondrial  membrane  protein  translocase;  TOM,  transport  across  the  outer  membrane;  GIM,  prefoldin  protein  complex;  t-SNARE,  target 
SNAP  (Soluble  NSF  Attachment  Protein)  Receptor;  v-SNARE,  vesicle  SNAP  (Soluble  NSF  Attachment  Protein)  Receptor;  AP-2,  Adaptor  protein 
complex-2;  AP-3,  Adaptor  protein  complex-3;  ERV25,  ER  Vesicle  25. 


sets  is  not  productive  because  of  the  almost  negligible  over¬ 
lap  of  interactions.  Instead,  we  have  focused  on  a  comparison 
of  interactions  between  select  functional  complexes  associ¬ 
ated  with  “RNA  synthesis  related  complexes"  and  a  separate 
comparison  of  the  “ Protein  transport  machineries”  to  highlight 
the  inherent  methodological  biases  of  the  three  high-through- 
put  techniques. 

RNA  Synthesis  Related  Complexes— RNA  synthesis  is 
closely  connected  to  chromatin  remodeling  and  other  biolog¬ 
ical  processes  that  control  DNA.  Fig.  5  shows  that  RNA  syn¬ 
thesis  complexes  formed  a  highly  interconnected  cluster,  in¬ 
cluding  RNA  polymerases  I,  II,  and  III,  Transcription  factor 
complexes  II  F  (TFIIF)  and  III  C  (TFIIIC),  which  were  connected 
via  direct  protein-protein  interactions  with  many  other  func¬ 


tional  complexes.  Fig.  6A  shows  a  comparison  of  the  number 
of  detected  intra-  and  intercomplex  protein-protein  interac¬ 
tions  among  and  between  selected  MIPS  annotated  com¬ 
plexes.  In  particular,  Fig.  6A  highlights  the  intra-  and  intercon¬ 
nectivity  among  RNA  synthesis,  chromatin  remodeling,  and 
other  DNA  interacting  protein  complexes  for  the  three  high- 
confidence  interaction  data  sets.  Consistent  with  the  higher 
intra-annotation  fraction  of  the  AP/MS  data  (Table  I  and  Fig. 
2),  the  diagonal  elements  for  this  data  set  were  the  most 
populated.  Therefore,  the  AP/MS  data  show  that  these  three 
biological  processes  are  highly  intraconnected  via  protein- 
protein  interactions;  the  PCA  data  capture  part  of  these  inter¬ 
actions  among  RNA  synthesis  complexes  and  the  Y2H  data 
capture  protein  interactions  mainly  among  the  DNA-associ- 
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ated  protein  complexes,  such  as  TATA-binding  protein  asso¬ 
ciated  factors  (TAFIIs)  and  SAGA  complexes.  This  is  also 
consistent  with  the  observation  that  Y2H  methods  preferen¬ 
tially  detect  protein  interactions  that  contains  specific  DNA- 
binding  motifs  (40). 

The  off-diagonal  elements  of  Fig.  6 A  address  protein  inter¬ 
action  links  between  different  functional  complexes  and  draw 
attention  to  the  different  types  of  interactions  retrieved  by  the 
three  high-throughput  methods.  Chromatin  remodeling  is  re¬ 
quired  to  initiate  and  conduct  RNA  synthesis  and  DNA  repli¬ 
cation  and  repair.  Fig.  5  shows  the  overall  location  of  chro¬ 
matin  remodeling  and  RNA  polymerase  proteins  of  the  AP/MS 
data  set,  whereas  Fig.  6A  shows  the  detailed  complex-com¬ 
plex  interaction  mapping  for  the  three  high-confidence  net¬ 
works.  The  interaction  data  support  the  RNA  polymerases  II 
and  III  connection  with  chromatin  remodeling  modules,  via 
RNA  polymerases  II  interacting  with  the  nucleosome  remod¬ 
eling  complexes  SWI/SNF  and  INO80  and  via  RNA  polymer¬ 
ases  III  interacting  with  the  chromatin  structure  remodeling 
(RSC)  complex  and  nucleosomal  protein  complexes  (his¬ 
tones)  (41,  42).  DNA  interacting  assemblies,  such  as  TAFIIs 
and  RNA  polymerase  II  mediator  complex  (SRB),  were  also 
directly  linked  via  protein-protein  interactions  to  RNA  synthe¬ 
sis  complexes,  whereas  others,  such  as,  histone  acetyltrans- 
ferase  complexes  [SAGA  (43)  and  NuA4  (44)]  and  histone 
deacetylase  complex  were  indirectly  linked  (43-46).  As  shown 
in  Fig.  1 ,  the  AP/MS  data  set  was  relatively  enriched  in  pro¬ 
teins  and  protein  interactions  associate  with  cell  cycle  pro¬ 
cesses  compared  with  the  PCA  or  Y2H  data.  Hence,  the 
organization  of  the  AP/MS  interaction  network  captured  the 
biological  association  of  the  functional  components  of  the  cell 
cycle  via  the  assembly  of  direct  protein-protein  interactions  to 
a  higher  degree  than  the  PCA  and  Y2H  networks. 

Protein  Transport  Machineries—  Fig.  5  shows  the  relative 
isolation  of  transport  complexes  in  the  AP/MS  network  recon¬ 
struction.  Fig.  66  shows  an  overall  comparison  between  de¬ 
tected  protein  interactions  within  transport  complexes  in  the 
three  high-confidence  data  sets  as  well  as  the  interaction 
tendency  between  the  transporters  and  other  proteins  in  dif¬ 
ferent  cellular  compartments.  Herein,  we  reported  the  co¬ 
occurrences  of  MIPS  labels  (“Locations”  and  “Transporters”) 
between  interacting  proteins  based  on  Z-score  calculations 
(See  Methods),  where  the  green  color  indicates  avoidance 
and  the  red  color  indicates  preferential  co-occurrence.  This 
score  measures  the  likelihood  that  proteins  in  different  anno¬ 
tation  categories  are  associated  and  interact  with  each  other. 
For  the  transporters  themselves,  the  AP/MS  data  recovered 
the  intracomplex  interactions  to  a  higher  degree  than  in  the 
corresponding  PCA  and  Y2H  data  sets.  Conversely,  interac¬ 
tions  between  these  transport  complexes  and  proteins  lo¬ 
cated  in  the  cytoplasm,  nucleus,  mitochondria,  endoplasmic 
reticulum  (ER),  and  Golgi  were  strongly  avoided  in  the  AP/MS 
data  set.  The  PCA  and  Y2H  data  sets  did  not  show  this 
avoidance;  instead,  many  more  interactions  between  trans¬ 


porter  complexes  and  other  cellular  proteins  were  present  in 
these  data  sets.  As  a  specific  example,  members  of  the  p24 
family  are  engaged  in  protein  transport  between  the  ER  and 
the  Golgi  apparatus.  Protein  transport  is  executed  via  COPII- 
coated  vesicles,  where  four  of  the  p24  member  proteins 
(ERV25,  EMP24,  ERP1,  and  ERP2)  are  known  to  form  the 
ERV25  complex  that  line  the  vesicles  and  interact  with  cyto¬ 
plasmic  coat  proteins  (47).  Annotated  as  a  cellular  transport 
protein,  ERV25  was  associated  with  43  interactions  in  the 
PCA  data  set,  but  with  only  five  interactions  in  the  AP/MS  data 
set.  Fig.  66  also  captured  the  gross  features  of  this  interaction 
imbalance  with  a  virtual  absence  of  interactions  in  the  AP/MS 
data  between  the  ERV25-complex  and  proteins  in  the  five 
locations  shown,  compared  with  an  abundance  of  interac¬ 
tions  in  the  PCA  data.  Whereas  none  of  these  interactions 
were  present  in  the  Y2H  data  set,  only  one  protein  (ERP1) 
among  these  43  PCA  proteins  coincided  with  the  AP/MS  set, 
suggesting  that  a  large  portion  of  the  detected  PCA  interac¬ 
tions  were  weak  or  transient  physical  associations,  such  as 
those  between  the  transporters  and  the  protein  objects  that 
they  transport. 

The  raw-AP/MS  data  set  contains  similar  weak  associations 
between  cellular  machinery  components  and  the  objects  they 
operate  upon.  For  example,  as  a  sub-unit  of  the  peripheral 
membrane  domain  of  the  vacuolar  H+-ATPase  (V-ATPase), 
VMA2  was  associated  with  512  and  six  interactions  in  the  raw 
and  high-confidence  AP/MS  data  sets,  respectively.  The  high 
number  of  interactions  in  the  raw  data  set  was  most  likely  a 
true  reflection  of  the  cellular  function  of  the  protein,  i.e. 
V-ATPase  is  an  enzyme  with  remarkably  diverse  functions  in 
eukaryotic  organisms  and  it  acidifies  a  wide  array  of  intracel¬ 
lular  organelles  by  pumping  protons  across  plasma  mem¬ 
branes.  Although  other  V-ATPase-interactions  were  present, 
none  of  the  VMA2  specific  interactions  was  present  in  the 
high-confidence  PCA  or  the  Y2H  data  sets  (Fig.  6B).  In  gen¬ 
eral,  although  PCA  can  capture  some  of  the  transient  inter¬ 
actions  very  accurately,  the  IDBOS-analysis  of  the  AP/MS 
purification  data  cannot  distinguish  transient  or  true  weak 
interactions  from  low-confidence  promiscuous  associations. 

Although  the  overall  coverage  of  interaction  in  the  high- 
confidence  AP/MS  data  set  remained  small  compared  with 
the  complete  interactome,  we  could  still  recover  a  rough 
outline  of  many  of  the  components  of  the  cellular  machinery 
and  connections  among  and  between  these  components. 
The  clusters  of  proteins  mapped-out  in  Fig.  5  defined  bio¬ 
logically  distinct  components  and  assemblies  that  consti¬ 
tuted  the  bulk  of  the  cellular  machinery.  These  biological 
units  were,  in  turn,  connected  with  other  units  and  delin¬ 
eated  a  global  organization  of  the  working  components  of 
the  cell.  Thus,  the  high-confidence  AP/MS  data  set  yielded 
different,  complementary  insights  into  protein  properties, 
protein  assemblies,  and  how  they  are  cross-connected  in 
the  cell  compared  with  the  PCA  and  Y2H  high-confidence 
data  sets. 
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Fig.  7.  Protein  abundance  correlation  between  interacting  pro¬ 
tein  pairs.  A ,  Spearman  rank-correlation  coefficients  rSpearman  be¬ 
tween  abundance  ranks  of  interacting  proteins.  The  histograms  are 
shown  for  all  interactions  (All  interactions),  the  subset  that  encom¬ 
passes  interactions  that  share  the  same  intrafunction  annotations 
(Intrafunction),  and  all  remaining  ones  that  do  not  (Other).  We  indi¬ 
cated  correlations  that  were  not  statistically  significant  by  a  star  (*). 
8,  Abundance  map  of  the  AP/MS  data  set.  Abundance  values  (50) 
were  divided  into  1 1  classes  from  0  (smallest)  to  1 0  (largest),  and  each 
class  was  represented  by  a  color.  Class  0  is  the  collection  of  proteins 
whose  abundances  were  too  small  to  be  detected,  and  classes  1-10 
were  equally  divided  among  proteins  whose  abundance  could  be 
detected.  Interacting  proteins  in  each  visible  cluster  tended  to  have 
the  same  abundance  values. 


Protein  Abundance  of  Interacting  Protein  Pairs—  Previous 
research  has  shown  that  proteins  in  stable  complexes  tend  to 
have  similar  mRNA  expression  profiles  (48,  49).  It  is  natural  to 
assume  that  two  interacting  proteins  are  more  likely  to  be  of 
similar  abundance  than  two  noninteracting  proteins.  This  as¬ 
sumption  may  now  be  directly  tested  with  the  advent  of 
protein  abundance  measurements  such  as  the  recent  mea¬ 
surement  of  yeast  cells  in  rich  media  (50).  For  all  the  proteins 
in  our  study  we  retrieved  the  corresponding  proteins’  cellular 
abundance  data  from  the  study  of  Newman  et  at.  (50).  Fig.  7 A 
shows  the  calculated  Spearman  correlation  coefficients  be¬ 
tween  abundances  of  interacting  proteins  for  the  five  data 
sets  in  Table  I.  In  order  to  verify  that  these  correlations  were 
not  spuriously  related  to  the  observed  interactions,  we  cal¬ 
culated  the  correlation  coefficients  using  1000  randomly 
rewired  versions  of  each  protein  interaction  data  set  and 
detected  no  significant  correlations.  Using  the  same  cellular 
abundance  data,  we  also  “colored”  proteins  in  the  high- 
confidence  AP/MS  derived  network  from  Fig.  3  according  to 


their  abundance.  Fig.  7 B  shows  the  corresponding  abun¬ 
dance  map,  which  demonstrates  that  interactions  were 
much  more  likely  to  occur  between  proteins  of  similar 
abundance. 

The  curated  binary  interaction  data  set  (BGS)  showed  a 
clear  positive  correlation  of  the  protein  abundance  of  the 
interacting  pairs,  indicating  that  the  assumption  that  interact¬ 
ing  proteins  having  similar  concentrations  tend  to  interact  with 
each  other  was  a  reasonable.  All  high-confidence  data  sets 
exhibited  statistically  significant  positive  correlations  to  vary¬ 
ing  degrees.  However,  the  magnitude  of  the  positive  correla¬ 
tions  for  the  high-confidence  AP/MS,  PCA,  and  Y2H  data 
sets,  0.53  (p  value  <  10~7O°),  0.17  (p  value  <  1 0  20),  and  0.07 
(p  value  <  10~5),  respectively,  varied  substantially.  Using  the 
intra-  and  interannotation  schemes  shown  in  Table  I,  we  cal¬ 
culated  the  corresponding  intra-function  correlation  values. 
Fig.  7A  shows  that  the  high-confidence  AP/MS  data  retained 
the  higher  abundance  correlation  for  this  subset  of  interac¬ 
tions,  whereas  the  PCA  and  Y2H  intrafunction  correlation 
values  increased  to  roughly  0.2  (p  values  <  1 0  30).  The  rela¬ 
tively  lower  correlation  values  of  the  intrafunction  Y2H  and 
PCA  data  sets  with  respect  to  the  AP/MS  ultimately  depended 
on  the  functional  characteristics  of  the  proteins  sets  retrieved 
by  the  different  methodologies.  For  example,  the  enrichment  of 
interactions  in  the  AP/MS  data  set  of  highly  coordinated  func¬ 
tions,  such  as  cell  cycle,  transcriptions,  and  protein  synthesis, 
resulted  in  an  enhanced  abundance  correlation.  Likewise,  the 
enrichment  of  interactions  in  the  PCA  data  set  of  transport 
functions,  which  do  not  require  the  expression  of  proteins  at  the 
same  level,  resulted  in  a  decreased  abundance  correlation. 

In  contrast  to  the  high-confidence  data  sets,  we  observed  a 
negative  correlation  in  the  raw  AP/MS  data  set.  This  correla¬ 
tion  is  an  artifact  of  the  experimental  methodology  and  it 
illustrates  how  technology  can  mislead  a  casual  interpretation 
of  the  data.  In  the  raw  affinity  purification  data,  there  is  a 
positive  correlation  between  the  number  of  interactions  in 
which  a  protein  is  engaged  and  the  cellular  abundance  of  the 
proteins  (49),  i.e.  proteins  with  many  interaction  partners 
(hubs)  are  typically  high-abundance  proteins  and  proteins 
with  few  interactions  are  typically  low-abundance  proteins. 
However,  there  are  fewer  hub  proteins  that  result  in  hub- 
proteins  in  the  raw  data  interacting  with  many  more  nonhub 
proteins.  Thus,  the  raw  AP/MS  data  showed  an  overall  inverse 
correlation,  i.e.  interactions  between  high-  and  low-abun¬ 
dance  proteins  are  more  likely  to  occur  than  between  similarly 
abundant  proteins.  Fig.  7 A  shows  that  neglecting  protein  pairs 
that  we  inferred  to  have  positive  abundance  correlations  in  the 
raw-AP/MS  data  set,  e.g.  those  occurring  within  the  same 
functions  (Intra-function),  and  evaluating  the  abundance  cor¬ 
relation  in  the  remaining  set  (Other,  rSpearman  =  -0.34,  p 
value  <  10  300)  enhanced  this  effect. 

The  application  of  an  intrafunction  constraint  to  select  in¬ 
teracting  proteins  clearly  increased  the  correlation  of  all  these 
data  sets,  but  it  also  highlighted  the  inherent  differences  in  the 
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interaction  data  present  in  the  high-confidence  data  sets.  We 
found,  to  a  higher  degree  than  for  the  PCA  and  Y2H  data  sets, 
that  the  high-confidence  AP/MS  data  were  enriched  with 
interactions  whose  proteins  were  present  at  the  same  con¬ 
centrations.  To  further  confirm  the  assumption  that  interacting 
proteins  should  be  present  in  roughly  the  same  concentra¬ 
tions,  we  assessed  whether  interactions  in  the  AP/MS  data 
sets  were  sensitive  to  their  cellular  location.  The  expectation 
was  that  interactions  present  in  the  same  cellular  locations 
(intralocation)  should  have  significantly  higher  correlation  val¬ 
ues  between  proteins  than  those  that  do  not  share  the  same 
cellular  location  (interlocation).  We  calculated  the  correlation 
values  for  interactions  in  the  intra-  and  interlocation  set  for  the 
AP/MS  to  be  0.55  (p  value  1(T729)  and  0.10  (p  value  0.004), 
respectively.  This  was  consistent  with  complexes  isolated  from 
the  AP/MS  purifications  retaining  their  natural  composition, 
commensurate  with  their  native  location,  and  not  contaminated 
with  protein  complexes  from  different  cellular  locations. 

Essentiality  of  Proteins  in  the  Different  Interaction  Data 
Sets— Relating  the  essentiality  of  a  protein  to  its  position  in  a 
protein  interaction  network  has  highlighted  the  uncertain  na¬ 
ture  of  biological  interpretations  of  data-driven  reconstruc¬ 
tions  of  protein  interaction  networks.  Because  the  time  of  the 
earliest  observation,  based  on  one  of  the  first  Y2H  sets  (10), 
that  hubs  that  have  more  interactors  are  more  likely  to  be 
essential  (51),  other  investigations  have  been  carried  out  and 
confirmed  a  positive  correlation  between  essentiality  and  de¬ 
gree,  i.e.  the  essentiality-connectivity  rule  (52,  53).  However, 
different  explanations  to  this  rule  also  emerged:  (1)  hubs  play 
a  vital  role  in  maintaining  the  network  connectivity  (54);  (2) 
essentialities  of  proteins  come  from  their  interactions,  and 
hubs  are  more  likely  to  be  involved  in  essential  interactions 
(55);  (3)  rejecting  explanations  1  and  2  above,  Zotenko  and 
colleagues  demonstrated  instead  that  hubs  tend  to  be  essen¬ 
tial  because  of  their  memberships  in  essential  modules  (56),  a 
previously  suggested  concept  (21).  In  line  with  explanation  3, 
another  study  has  shown  that  large  complexes  tend  to  be 
essential  (57).  However,  a  recent  investigation  of  the  consol¬ 
idated  Y2H  set  and  the  union  of  two  original  AP/MS  sets, 
invalidated  the  essentiality-connectivity  rule  (11),  indicating 
the  dependence  of  this  rule  on  the  underlying  interaction 
networks.  Herein,  we  quantified  this  dependence,  highlighted 
the  inherent  differences  among  the  high-confidence  data 
sets,  and  provided  a  consensus  analysis  to  finally  infer  a 
consistent  conclusion  from  all  data  sets. 

Fig.  8A  shows  the  fraction  of  essential  proteins  in  the  set  of 
proteins  defined  by  a  specific  hub-threshold,  where  a  hub- 
threshold  of  0.1  means  that  we  selected  the  highest  top  10% 
connected  proteins  as  hubs.  The  difference  in  essentiality- 
connectivity  correlations  among  all  the  protein  interaction 
data  sets  from  Table  I  is  quite  evident.  The  PCA  data  exhibited 
an  opposite  behavior,  in  that,  the  more  connected  the  pro¬ 
teins,  the  smaller  was  the  fraction  of  essential  proteins.  In 
contrast,  our  high-confidence  AP/MS  data  showed  a  strong 


correlation  between  essentiality  and  connectivity.  In  line  with 
the  investigation  by  Yu  et  ai.  (11),  we  confirmed  that  previ¬ 
ously  constructed  high-confidence  AP/MS  sets  (6,  7,  20)  did 
not  have  such  a  strong  correlation.  Consistent  with  the  ap¬ 
parent  neutral  sampling  of  interaction  in  the  Y2H  data  set,  the 
essential  fraction  was  mainly  independent  of  the  hub-thresh¬ 
old.  The  manually  curated  binary  data  set  and  the  raw  AP/MS 
data  exhibited  similar  modest  correlations  at  lower  hub- 
thresholds,  i.e.  among  highly  connected  proteins. 

Given  the  pervasive  belief  that  protein  interaction  data  sets 
are  filled  with  “noise,”  we  investigated  the  consequences  of 
imposing  a  “high-confidence”  filter  in  the  AP/MS  data  set.  Fig. 
8B  shows  the  effect  on  the  essentiality-connectivity  correla¬ 
tion  as  a  function  of  the  false  positive  rate.  Decreasing  the 
confidence  from  the  designated  high-confidence  level  at  a  5% 
false  positive  rate  significantly  weakened  the  essentiality-con¬ 
nectivity  correlation.  At  false  positive  rates  at  or  above  20%, 
the  essentiality-connectivity  correlation  was  similar  to  that  of 
the  raw  data  for  hub-thresholds  less  than  0.1.  To  determine 
whether  the  high  correlation  for  the  high-confidence  AP/MS 
data  set  stemmed  from  the  essentiality  of  large-sized  com¬ 
plexes,  we  used  MIPS  complexes  to  investigate  the  complex- 
size  essentiality  correlation  (data  not  shown).  Contradicting  a 
previous  study  that  used  their  self-defined  complexes  and 
which  suggested  a  positive  correlation  (57),  we  found  that  the 
global  complex-size  essentiality  correlation  did  not  exist  and 
that  there  was  a  positive  correlation  only  for  small  complexes 
comprising  less  than  10  proteins. 

Instead,  we  found  that  the  largest  influence  on  the  essen¬ 
tiality-connectivity  correlation  in  the  data  sets  was  the  de¬ 
pendence  of  the  functions  of  the  retrieved  proteins  in  the 
different  data  sets.  Fig.  8C  shows  the  essential  fraction  of  all 
yeast  proteins  annotated  by  MIPS,  which  indicated  that  pro¬ 
teins  involved  in  “Transcription,”  “Protein  synthesis,”  and 
“Protein  binding”  tended  to  be  more  essential  than  other 
groups.  Comparing  this  figure  to  the  retrieved  proteins  in  the 
high-confidence  data  sets  and  their  interactions  in  Fig.  1C, 
makes  it  clear  that  conflicting  essentiality-connectivity  corre¬ 
lations  of  the  high-confidence  data  sets  were  closely  related 
with  their  opposite  interaction  frequency  profiles  in  functional 
categories.  For  example,  as  mentioned  above,  AP/MS  sam¬ 
pled  a  large  fraction  of  interactions  involved  in  transcription, 
whereas  PCA  sampled  much  fewer  of  these  interactions.  The 
negative  essentiality-connectivity  correlation  in  the  PCA  data 
set  in  Fig.  8A  was  a  direct  consequence  of  the  relatively  higher 
number  of  interactions  sampled  in  the  “Cellular  transport” 
category  (Fig.  1C)  and  these  proteins’  relatively  low  fraction  of 
essential  proteins  (Fig.  8C).  To  exclude  the  effect  of  the  inter¬ 
action  sampling  difference,  we  investigated  which  proteins 
were  more  likely  to  be  essential  within  a  given  functional 
category.  Thus,  for  each  data  set,  we  separated  proteins  in 
each  MIPS  category  with  degrees  above  and  below  the  cat¬ 
egory  average  into  two  groups,  and  found  that  the  above- 
average  group  (“Higher-connectivity  proteins”  in  Fig.  8 D)  was 
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Fig.  8.  Distribution  of  essential  proteins.  A,  The  fraction  of  essential  proteins  among  hub  proteins  as  a  function  of  hub  threshold, 
represented  as  the  fraction  of  (hub)  proteins  with  degrees  larger  than  a  given  degree  among  all  proteins  of  the  studied  protein  interaction 
network.  For  example,  a  hub  threshold  of  0.1  means  that  we  selected  the  highest  top  10%  connected  proteins  as  hubs.  B,  The  essential  fraction 
of  proteins  in  the  AP/MS  data  are  shown  for  several  different  false  positive  rates,  indicating  the  sensitivity  of  the  essentiality-connectivity 
correlation  to  the  confidence  level  of  the  data.  C,  The  fraction  of  all  yeast  proteins  that  were  essential  as  a  function  of  the  Munich  Information 
Center  for  Protein  Sequences  (MIPS)  function  categories.  D,  More  connected  proteins  within  the  same  MIPS  function  category  tended  to  be 
essential  for  all  five  data  sets.  For  each  MIPS  complex,  we  calculated  the  average  degree  and  extracted  proteins  of  above-average  degrees 
(, Higher-connectivity  proteins)  into  a  group  and  those  with  below-average  degrees  (Lower-connectivity  proteins)  into  another  group.  After 
scanning  all  MIPS  complexes,  we  calculated  the  fraction  of  essential  proteins  in  each  group.  The  consensus  conclusion  indicated  that 
higher-connectivity  proteins  were  indeed  more  essential  than  lower-connectivity  ones. 


significantly  more  enriched  with  essential  proteins.  This  con¬ 
sensus  observation  demonstrated  that  when  removing  sam¬ 
pling  biases,  more  connected  components  in  a  functional 
category  tended  to  be  essential. 

CONCLUSIONS 

Recent  advances  in  high-throughput  experimental  tech¬ 
niques  have  identified  large  numbers  of  possible  protein- 
protein  interactions.  Although  the  protein  sets  tested  for  in¬ 
teractions  overlap,  the  actual  detected  protein  interactions 
among  these  data  sets  do  not  overlap  to  any  significant 
degree.  Consequently,  extracting  and  interpreting  biological 
information  from  networks  reconstructed  from  such  data  de¬ 
pend  on  the  underlying  detection  methodology.  Using  pro¬ 
tein-protein  interaction  networks  in  systems  biology  ap¬ 
proaches  to  study  phenotypic  behavior  requires  that  the 
networks  contain  the  relevant  proteins  responsible  for  the 
underlying  biological  processes  (17,  58).  Herein,  our  system¬ 
atic  classification  of  interactions  according  to  their  functional 
characteristics  and  each  technology’s  ability  to  detect  these 
interactions,  shed  insight  into  the  underlying  biases  of  the 


content  of  these  data  sets.  The  comparison  between  three 
different  high-confidence  high-throughput  data  sets  derived 
from  different  methodologies  provided  a  quantitative  measure 
of  the  functional  biases  of  the  retrieved  interactions  and  pro¬ 
teins.  This  confirmed  the  unbiased  nature  of  the  Y2H  data, 
preferential  retrieval  of  co-complex  interactions  from  AP/MS, 
and  the  ability  of  the  PCA  method  to  detect  many  transient 
interactions.  However,  we  also  noted  the  large  extent  of  bi¬ 
nary  interactions  present  in  the  high-confidence  AP/MS,  and 
the  presence  of  many  binary  interactions  from  complexes  that 
were  retrieved  in  the  Y2H  and  PCA  data  sets.  The  high- 
confidence  protein  interaction  data  sets  were  associated  with 
biologically  conflicting  properties,  such  as  protein  abundance 
and  essentiality,  biased  by  the  underlying  detection  method¬ 
ology.  These  biases  determined,  to  large  extent,  the  biological 
insights  derived  from  each  data  set  and  were  more  influential 
than  the  topological  properties  of  the  network  themselves. 
Although  these  biases  could  be  removed  for  certain  analyses, 
e.g.  consistently  relating  protein  connectivity  with  gene  es¬ 
sentiality,  much  work  remains  in  defining  the  properties  and 
the  range  of  suitable  applications  of  the  experimentally  deter- 
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mined,  partially  complete  interactome  in  systems  biology 
studies. 

A  proper  and  consistent  interpretation  of  biological  proper¬ 
ties  derived  from  high-throughput  protein  interaction  data 
sets  requires  careful  consideration  of  potential  methodologi¬ 
cal  biases  incorporated  in  the  data.  A  facile  way  to  quantify 
this  is  to  compare  the  functional  characteristics  of  the  con¬ 
stituent  proteins  by  comparing  the  relative  distribution  of 
these  proteins  among  functional  categories.  Although  analy¬ 
ses  of  data  with  the  same  protein  characteristics  should  yield 
consistent  insights,  apparent  contradictory  results  derived 
from  different  data  sets  may  tell  us  more  about  the  underlying 
experimental  biases  themselves  than  the  underlying  biology. 

MATERIALS  AND  METHODS 

Protein  Interaction  Data  Sets—\Ne  investigated  the  follow¬ 
ing  protein  interaction  data  sets,  comprising  three  high-con- 
fidence  high-throughput  data  sets  derived  from  AP/MS,  PCA, 
and  Y2H  experiments,  an  unfiltered,  raw  interaction  data  set 
(denoted  as  raw-AP/MS),  and  a  manually  curated  set  focusing 
on  binary  protein  interactions  (BGS).  Columns  1-3  of  Table  I 
summarize  the  number  of  proteins  and  interactions  contained 
in  the  following  five  data  sets: 

AP/MS— We  have  previously  developed  and  applied  the 
IDBOS  method  to  extract  and  analyze  high-confidence  pro¬ 
tein  interaction  networks  from  each  of  three  individual  AP/MS 
data  sets  (14).  It  was  determined  that  the  data  of  Gavin  et  al. 
(6)  showed  the  highest  specificity  of  protein  associations  and 
no  abundance  bias,  i.e.  high-abundance  proteins  tend  not  to 
have  more  interactions,  and  for  these  reasons  we  opted  to 
study  this  data  set  as  the  best  representation  of  a  high- 
confidence  AP/MS  data.  This  data  set  is  also  included  in  the 
supplemental  material  as  an  annotated,  downloadable  file. 

PCA— The  recent  PCA  strategy  detects  in  vivo  protein  in¬ 
teractions  via  fusions  to  enzyme  fragments  that,  when  recon¬ 
stituted,  restores  catalytic  activity  and,  consequently,  cell 
growth  (59).  This  methodology  does  not  depend  upon  the 
expression  of  a  reporter  protein  as  required  in  Y2H  screens. 
The  PCA  technique  was  applied  on  a  genome-wide  scale  for 
yeast  and  yielded  many  new,  previously  undiscovered  protein 
interactions  (8).  Although  we  are  aware  of  only  one  large-scale 
interaction  data  set  determined  using  the  PCA  technique,  the 
reported  98%  positive  predictive  value  of  these  interactions 
defines  them  as  a  high-confidence  data  set. 

Y2H— The  Y2H  method  was  the  first  high-throughput  tech¬ 
nology  to  assess  protein  interactions  on  a  genomic  scale,  and 
Yu  et  al.  have  consolidated  three  large-scale  Y2H  data  sets 
(10,  11,  60)  into  one  high-quality  representative  S.  cerevisiae 
data  set  (1 1).  This  data  set  was  representative  of  high-confi- 
dence  Y2H  interactions. 

Raw-AP/MS— To  illustrate  the  importance  of  using  a  high- 
confidence  network,  we  also  analyzed  the  raw  interaction 
data  of  Gavin  et  ai.  (6),  where  we  used  the  spoke  model  to 
construct  a  corresponding  protein  interaction  set.  In  this 


model,  we  retained  the  bait-prey  pairs  from  the  purification 
and  did  not  include  all  possible  prey-prey  pairs. 

BGS— The  BGS  interaction  data  set  is  a  manually  curated 
set  of  high-confidence  physical  binary  interactions  that  rep¬ 
resent  direct  protein  associations,  rather  than  indirect  ones 
(11).  This  interaction  set  has  been  shown  to  have  considerable 
overlaps  with  high-throughput  Y2H  data  sets  (11). 

High-Confidence  AP/MS  Protein  Interaction  Network—  Our 
IDBOS  procedure  for  AP/MS  data  can  be  summarized  as 
follows  (14):  For  a  given  affinity  purification  data  set  in  which 
individual  purifications  are  specified,  we  counted,  for  each 
unique  protein  pair  /  and  the  total  number  of  times  they 
co-occurred  in  the  same  purification  oir  This  analysis  corre¬ 
sponds  to  a  matrix  enumeration  of  possible  interacting  protein 
pairs  within  each  purification.  We  then  constructed  106  random¬ 
ized,  or  shuffled,  purification  sets  and  computed  average  shuf¬ 
fled  co-occurrences,  oip  and  associated  standard  deviations, 
o j,  over  these  sets.  A  shuffled  purification  set  was  constructed 
by  shuffling,  or  exchanging,  pairs  of  prey  proteins  in  the  data 
set.  The  co-occurrence  score  (CS,y)  for  each  protein  pair  was 
then  determined  as  the  Z-score  of  the  observed  co-occur¬ 
rences  given  by: 


CS,  = 


(Eq.  1) 


In  order  to  gauge  the  significance  of  these  scores,  we  also 
constructed  the  scores  associated  with  randomly  shuffled  pairs 
themselves.  First,  we  constructed  an  additional  10s  shuffled 
sets  in  the  same  manner  as  that  described  above.  Second,  for 
each  shuffled  set,  we  determined  the  Z-scores  for  protein  pairs 
having  a  shuffled  co-occurrence  of  greater  than  one  as: 


where  (>  1)  denote  the  co-occurrence  of  proteins  /  and  j  in 
the  nth  shuffled  set,  and  otj  and  <rtj  denote  the  mean  co¬ 
occurrences  and  standard  deviations,  respectively,  deter¬ 
mined  from  the  shuffled  sets  as  in  Equation  1 .  We  can  then 
determine  the  CS-score  cutoff  that  yields  a  particular  false 
positive  rate  by  comparing  the  normalized  experimental  (PE) 
and  random  (PR)  score  distributions  generated  from  the  data. 
For  a  given  score  threshold  £,  we  computed  the  fractions  of 
protein  pairs  in  the  commensurate  random  (fR)  and  experi¬ 
mental  (fE)  distributions  that  have  a  higher  score  than  f  as: 


and 


m 


jpR(x)dx 


t 


(Eq.  3) 


J  PE(x)dx. 


c 


(Eq.  4) 
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We  then  approximated  the  false  positive  rate  (FPR)  as  the 
ratio  of  these  fractions  at  a  given  score  threshold  £,  as: 

FPR(Q  =  (Eq.  5) 

For  example,  for  a  false  positive  rate  of  5%  we  computed  the 
corresponding  score  cutoff  £005  for  the  high-confidence 
AP/MS  data  set  to  be  5.95.  We  compiled  high-confidence 
data  sets  at  different  false  positive  rates  by  including  only 
interactions  having  higher  CS  scores  than  their  respective 
cutoffs. 

Protein  Annotation  and  Essentiality  Data— The  Munich  In¬ 
formation  Center  for  Protein  Sequences  (MIPS)  protein  data 
were  downloaded  from  the  MIPS  database  (ftp://ftpmips. 
gsf.de/yeast/catalogues/)  (31).  We  used  the  top-level  protein 
function  (funcat/)  and  location  (subcellcat/)  annotation  labels 
to  characterize  proteins.  The  function  labels  ranged  from  me¬ 
tabolism  to  cell  differentiation,  whereas  the  location  labels 
ranged  from  extracellular  to  lipid  particles.  We  only  consid¬ 
ered  protein  complexes  (complexcat/)  identified  in  small-scale 
experiments,  by  excluding  complexes  listed  under  category 
550,  labeled  as  “complexes  by  systematic  analysis.”  Essen¬ 
tiality  data  were  merged  from  both  MIPS  (gene_disruption/) 
and  the  Saccharomyces  Genome  Database  (http://www. 
yeastgenome.org/)  (47),  where  a  protein  was  considered  es¬ 
sential  if  it  were  labeled  so  in  at  least  one  of  the  data  sets. 

Assignment  of  a  Single  Property  to  Multiply  Annotated  Pro¬ 
teins  in  Network  Visualization—  It  is  often  the  case  that  MIPS 
annotations  assign  more  than  one  function  and/or  location  to 
a  protein.  These  “moonlighting”  proteins  presented  no  prob¬ 
lem  in  our  analyses;  however,  for  the  network  visualization,  it 
was  convenient  to  select  a  unique  annotation  item  for  each 
protein  from  each  MIPS  annotation  category.  These  were 
chosen  in  order  to  maximize  the  number  of  homogeneous 
interactions,  i.e.  those  between  proteins  having  the  same 
annotation  item.  By  assuming  that  each  homogeneous  inter¬ 
action  had  energy  -1  and  others  had  energy  0,  we  trans¬ 
formed  the  problem  into  a  typical  energy-minimization  one. 
The  total  system  energy  was  minimized  using  a  Metropolis 
Monte  Carlo  annealing  algorithm  on  a  random  initial  annota¬ 
tion  configuration  in  which  each  moonlighting  protein  was 
randomly  assigned  one  of  its  MIPS  annotation  labels.  For  a 
protein  interaction  network,  we  uniformly  annealed  the  system 
from  a  temperature  of  50  to  0.2  (for  simplicity  sake,  both 
energy  and  temperature  have  the  same  unit)  in  10,000  steps. 
A  step  consisted  of  a  loop  of  operations  on  every  moonlight¬ 
ing  protein.  In  one  operation,  the  current  annotation  label  for 
the  protein  was  temporally  replaced  by  another,  which  was 
randomly  selected  from  its  set  of  MIPS  annotation  labels.  The 
operation  was  accepted  if  the  new  system  energy  was  lower, 
or  accepted  with  a  probability  of  exp(-AE/7)  if  the  new  system 
energy  was  higher,  where  A E  denotes  the  energy  change 
caused  by  the  operation  and  T  denotes  temperature.  The 


annotation  configuration  was  updated  when  the  operation 
was  accepted  and  remained  unchanged  when  the  operation 
was  rejected.  The  lowest-energy  configuration  in  the  simula¬ 
tion  was  chosen  for  the  visualization  of  this  data  set. 

Connection  Z-Scores  of  Labels  of  Interacting  Proteins— To 
gauge  the  statistical  connections  between  the  functional  cat¬ 
egories  assigned  to  the  constituent  proteins  in  an  interaction, 
we  compared  the  originally  annotated  interactions  with  shuf¬ 
fled  annotations.  During  a  shuffle,  all  annotations  assigned  to 
a  protein  were  kept  as  a  single  package.  In  a  shuffle  simula¬ 
tion,  each  annotated  protein  was  randomly  chosen  to  switch 
their  annotation  packages  with  another  protein.  For  each 
high-confidence  interaction  data  set,  we  carried  out  10,000 
simulations.  The  connection  significance  between  annotation 
categories  /  and  j  was  represented  as  a  Z-score,  as  follows: 

Cn  -  C, 

Z *  =  - - \  (Eq.  6) 

aii 

where  denotes  the  number  of  interactions  between  a  pro¬ 
tein  annotated  with  category  /  and  another  annotated  with 
category  j  in  the  data  set,  C,j  represents  the  average  C,  over 
the  10,000  simulations,  and  oj  is  the  corresponding  standard 
deviation. 
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